feat(lumera): add LEP-6 chain client extensions#286
feat(lumera): add LEP-6 chain client extensions#286
Conversation
fba3844 to
6d3160b
Compare
The new LEP-6 chain-driven self-healing runtime (dispatcher, healer, verifier, finalizer, transport handler, SQLite dedup, cascade staging/publish, peer client) is well-structured, thoroughly tested, and cleanly integrated. The previous concurrency concern has been addressed with a
Mention @roomote in a comment to request specific changes to this pull request or fix all unresolved issues. |
6d3160b to
38849ea
Compare
38849ea to
fd7466f
Compare
Introduces pkg/storagechallenge/deterministic/lep6.go, the off-chain
computation library shared by the storage_challenge runtime, recheck
service, and self-healing dispatcher. Every function is pure (no I/O,
no clock, no goroutines) so independent reporters challenging the same
(target, ticket) pair produce byte-identical StorageProofResult fields.
Functions land in two categories:
CHAIN-MIRRORED (must match lumera/x/audit/v1/keeper/audit_peer_assignment.go
byte-for-byte; the chain re-runs them to validate MsgSubmitEpochReport):
- SelectLEP6Targets — 1/3 deterministic target subset
(SHA-256(seed||0x00||account||0x00||"challenge_target"),
targetCount = ceil(N/divisor) clamped to [1, N])
- PairChallengerToTarget / AssignChallengerTargets — challenger->target
pairing (label "pair"), with no-self and lex tie-break
SUPERNODE-CANONICAL (chain stores outputs as opaque strings; this file
defines the canonical encoding all reporters must use to stay in lockstep):
- ClassifyTicketBucket — RECENT/OLD bucket classification using
Action.BlockHeight (Action.UpdatedHeight does not exist; see
docs/plans/LEP6_SUPERNODE_IMPLEMENTATION_PLAN.md "Resolved Decision 3")
- SelectTicketForBucket — deterministic per-(target,bucket) ticket pick
with excluded-set support for active heal ops
- SelectArtifactClass — LEP-6 §10 weighted roll (20% INDEX / 80% SYMBOL)
with deterministic fallback when a class has no artifacts
- SelectArtifactOrdinal — uniform ordinal mod artifactCount
- ComputeMultiRangeOffsets — k=4 range offsets in [0, size-rangeLen)
- ComputeCompoundChallengeHash — BLAKE3 over concat of slices in offset
order (lukechampine.com/blake3 to match the chain's library)
- DerivationInputHash — canonical hex of derivation inputs
- TranscriptHash — full canonical transcript identifier with sorted
observer ids; struct-input form prevents field-order mistakes
Domain separators ("challenge_target", "pair", "ticket_rank",
"artifact_class", "artifact_ordinal", "range_offset",
"derivation_input", "transcript") and enum string forms ("INDEX"/
"SYMBOL", "RECENT"/"OLD"/"PROBATION"/"RECHECK") are package
constants; freezing them prevents accidental drift between callers and
tests. Any change is a protocol-level break that requires versioning.
Tests:
- TestStorageTruthAssignmentHash_KnownVector locks the byte-level SHA-256
composition against an independent computation, guaranteeing the
chain-mirrored helper has not drifted.
- TestSelectLEP6Targets_OneThirdCoverage_AssignmentMatchesChain uses the
chain's own audit_peer_assignment_test.go fixture
(seed="01234567890123456789012345678901", active={sn-a..sn-f},
divisor=3) — output {sn-f, sn-e}.
- TestAssignChallengerTargets_KnownAssignment locks the full pairing
{sn-a -> sn-f, sn-b -> sn-e}.
- TestSelectArtifactClass_WeightedDistribution validates ~20% INDEX
over 5000 draws (±2% tolerance).
- Determinism, sensitivity, error-path, and out-of-bounds tests for
every primitive.
Verified: `go test ./pkg/storagechallenge/deterministic/...` passes; the
existing deterministic_test.go pre-LEP-6 tests continue to pass.
…#288) Implements the PR3 compound storage challenge runtime on top of the latest LEP-6 PR1 foundation branch after PR2 was merged into it. Highlights: - Add compound proof request/response fields and regenerated supernode proto bindings. - Add recipient-side GetCompoundProof handler with signed responses and range validation. - Add challenger-side LEP6Dispatcher for assigned-target dispatch across RECENT/OLD buckets. - Add result buffer implementing host_reporter ProofResultProvider with deterministic chain-cap throttling. - Add deterministic cascade metadata resolution helpers for artifact count, key, and exact artifact size. - Add production ChainTicketProvider backed by final Lumera x/action ListActionsBySuperNode query. - Wire startup to use ChainTicketProvider and cascade metadata/action size resolution instead of NoTicketProvider. - Classify target RPC timeout/no-response as TIMEOUT_OR_NO_RESPONSE and malformed transcripts as INVALID_TRANSCRIPT. - Extend action module bindings/mocks with ListActionsBySuperNode. - Preserve PR1 provider concurrency hardening and PR2 deterministic roocode fixes after rebase. Lumera dependency/source: - github.com/LumeraProtocol/lumera v1.12.0 - chain source: lumera/master 451f8a8e7ff30b3370cba59fab8e6228473a348b Validation: - git diff --check origin/supernode/LEP-6-chain-client-extensions..HEAD: pass - go test ./pkg/storagechallenge/... ./supernode/storage_challenge ./supernode/transport/grpc/storage_challenge ./supernode/host_reporter ./pkg/lumera/modules/action ./pkg/lumera/modules/audit ./pkg/lumera/modules/audit_msg -count=1 -v: pass - go vet ./pkg/storagechallenge/... ./supernode/storage_challenge ./supernode/transport/grpc/storage_challenge ./supernode/host_reporter ./pkg/lumera/modules/action ./pkg/lumera/modules/audit ./pkg/lumera/modules/audit_msg: pass - go test ./... -count=1: pass Plan: docs/plans/LEP6_SUPERNODE_IMPLEMENTATION_PLAN_v3_MASTER.md PR3
…289) Replaces the gonode-era peer-watchlist self-healing with a chain-mediated LEP-6 §18-§22 (Workstream C) implementation. Healer reconstructs locally and STAGES (no KAD publish), verifiers fetch reconstructed bytes from the assigned healer over a streaming gRPC RPC (§19 healer-served path) and hash-compare against op.ResultHash, then publish to KAD only after chain VERIFIED quorum. Three-phase flow Phase 1 — RECONSTRUCT (no publish) cascade.RecoveryReseed(PersistArtifacts=false, StagingDir) → download remaining symbols → RaptorQ-decode → verify file hash against Action.DataHash → re-encode → stage symbols+idFiles+layout +reconstructed.bin to ~/.supernode/heal-staging/<op_id>/. Submit MsgClaimHealComplete{HealManifestHash}; chain transitions SCHEDULED → HEALER_REPORTED, sets op.ResultHash = HealManifestHash. Phase 2 — VERIFY (§19 healer-served path) Verifier opens supernode.SelfHealingService/ServeReconstructedArtefacts on the assigned healer (op.HealerSupernodeAccount), streams the reconstructed bytes, computes BLAKE3 base64 (=Action.DataHash recipe via cascadekit.ComputeBlake3DataHashB64), compares against op.ResultHash (NOT Action.DataHash — chain enforces at lumera/x/audit/v1/keeper/msg_storage_truth.go:291), and submits MsgSubmitHealVerification{verified, hash}. Chain quorum n/2+1. Phase 3 — PUBLISH (only on VERIFIED) Finalizer polls heal_claims_submitted (Opt 2b per-op poll, folded into single tick loop alongside healer + verifier dispatch), reads op.Status, calls cascade.PublishStagedArtefacts on VERIFIED (same storeArtefacts path as register/upload), deletes staging on FAILED/EXPIRED. Chain may reschedule a different healer on EXPIRED. Crash-recovery / restart-safety Submit-then-persist ordering: SQLite dedup row is written ONLY after chain has accepted the tx. A failed submit (mempool, signing, chain reject) leaves no row and staging is removed, so the next tick can retry cleanly. If chain accepted a prior submit but the supernode crashed before persisting, the next tick's resubmit fails with "does not accept healer completion claim" and reconcileExistingClaim re-fetches the heal-op, confirms chain ResultHash equals our manifest, and persists the dedup row so finalizer takes over. Negative-attestation hash: chain rejects empty VerificationHash even on verified=false (msg_storage_truth.go:271-273). Verifier synthesizes a deterministic non-empty placeholder (sha256("lep6:negative-attestation:"+reason) base64) on fetch_failed and hash_compute_failed paths. Chain only validates VerificationHash content for positive votes (msg_storage_truth.go:288-294), so any non-empty value is well-formed for negatives. Components added supernode/self_healing/ service.go Single tick loop; mode gate (UNSPECIFIED skips); healer dispatch; verifier dispatch; finalizer poll; sync.Map in-flight + buffered semaphores (reconstructs=2, verifications=4, publishes=2). healer.go Phase 1: submit-then-persist ordering; reconcileExistingClaim handles post-crash recovery when chain accepted a prior submit. verifier.go Phase 2: fetch from assigned healer, retry with exponential backoff (3 attempts), submit verified= false with non-empty placeholder hash on persistent fetch failure; positive-path hash compares against op.ResultHash; reconciles chain-side "verification already submitted" idempotency. finalizer.go Phase 3: VERIFIED → publish + cleanup; FAILED/ EXPIRED → cleanup only; transient states no-op. peer_client.go secureVerifierFetcher dials via the same secure-rpc / lumeraid stack the legacy storage_challenge loop uses. supernode/transport/grpc/self_healing/handler.go Streaming ServeReconstructedArtefacts RPC. DefaultCallerIdentityResolver pulls verifier identity from the secure-rpc (Lumera ALTS) handshake via pkg/reachability.GrpcRemoteIdentityAndAddr — production wiring uses this so req.VerifierAccount is never trusted alone. Authorizes caller ∈ op.VerifierSupernodeAccounts AND identity == op.HealerSupernodeAccount; refuses with FailedPrecondition when not the assigned healer and PermissionDenied for unassigned callers. 1 MiB chunks. proto/supernode/self_healing.proto SelfHealingService { ServeReconstructedArtefacts streams chunks }. Makefile gen-supernode wires it; gen/supernode/self_healing*.pb.go regenerated. supernode/cascade/reseed.go Split RecoveryReseed: PersistArtifacts=true (legacy/republish) vs PersistArtifacts=false (LEP-6 stage-only). Adds stageArtefacts + PublishStagedArtefacts. Stages reconstructed file bytes and a JSON manifest the §19 transport reads. supernode/cascade/staged.go ReadStagedHealOp helper used by the transport handler. supernode/cascade/interfaces.go CascadeTask interface gains RecoveryReseed + PublishStagedArtefacts so self_healing depends only on the factory abstraction. pkg/storage/queries/self_healing_lep6.go Tables heal_claims_submitted (PK heal_op_id) and heal_verifications_submitted (PK (heal_op_id, verifier_account)) for restart dedup. Typed sentinel errors ErrLEP6ClaimAlreadyRecorded / ErrLEP6VerificationAlreadyRecorded. Migrations wired in OpenHistoryDB. pkg/storage/queries/local.go LocalStoreInterface embeds LEP6HealQueries. supernode/config/config.go SelfHealingConfig YAML block (enabled, poll_interval_ms, max_concurrent_*, staging_dir, verifier_fetch_timeout_ms, verifier_fetch_attempts). Default disabled until activation. supernode/cmd/start.go Constructs selfHealingService.Service + selfHealingRPC.Server (with DefaultCallerIdentityResolver) when SelfHealingConfig.Enabled, registers SelfHealingService_ServiceDesc on the gRPC server, appends the runner to the lifecycle services list. Reuses cService (cascade factory) and historyStore. Tests (16 mandatory; all PASS) supernode/self_healing/service_test.go 1. TestVerifier_ReadsOpResultHashForComparison (R-bug pin) 2. TestVerifier_HashMismatchProducesVerifiedFalse 2b. TestVerifier_FetchFailureSubmitsNonEmptyHash (BLOCKER pin) 3. TestVerifier_FetchesFromAssignedHealerOnly (§19 gate) 6. TestHealer_FailedSubmitDoesNotPersistDedupRow (ordering) 6b. TestHealer_ReconcilesExistingChainClaimAfterCrash (recovery) 7. TestHealer_RaptorQReconstructionFailureSkipsClaim (Scenario C1) 8. TestFinalizer_VerifiedTriggersPublishToKAD (Scenario A) 9. TestFinalizer_FailedSkipsPublish_DeletesStaging (Scenario B) 10. TestFinalizer_ExpiredSkipsPublish_DeletesStaging (Scenario C2) 11. TestService_NoRoleSkipsOp 12. TestService_UnspecifiedModeSkipsEntirely (mode gate) 13. TestService_FinalStateOpsIgnored 14. TestDedup_RestartDoesNotResubmit (3-layer dedup) supernode/transport/grpc/self_healing/handler_test.go 4. TestServeReconstructedArtefacts_AuthorizesOnlyAssignedVerifiers 5. TestServeReconstructedArtefacts_RejectsUnassignedCaller (also covers non-assigned-healer FailedPrecondition refusal) pkg/storage/queries/self_healing_lep6_test.go TestLEP6_HealClaim_RoundTripAndDedup TestLEP6_HealVerification_PerVerifierDedup Validation go test ./supernode/self_healing/... PASS (2.66s) go test ./supernode/transport/grpc/self_healing/... PASS (0.09s) go test ./supernode/cascade/... PASS (0.09s) go test ./pkg/storage/queries/... PASS (0.20s) go test ./pkg/storagechallenge/... ./supernode/storage_challenge \ ./supernode/host_reporter ./pkg/lumera/modules/audit \ ./pkg/lumera/modules/audit_msg PASS go vet (touched + all transitively reachable pkgs) PASS go build (targeted) PASS (full repo go build fails only on pre-existing github.com/kolesa-team/go-webp libwebp-dev system-header issue; unrelated to this change.) Resolved decisions applied ✓ Branch base: PR-3 tip f79f88f, NOT self-healing-improvements (single chain-driven service per Bilal direction; legacy 3-way Request/Verify/Commit RPC discarded). ✓ Verifier compares against op.ResultHash (chain msg_storage_truth.go :291). Pinned by TestVerifier_ReadsOpResultHashForComparison. ✓ Hash recipe = cascadekit.ComputeBlake3DataHashB64 (=Action.DataHash recipe). Same recipe healer + verifier + chain enforce. ✓ KAD publish AFTER chain VERIFIED (§19 healer-served-path gate); staging directory is the only authority before quorum. ✓ Finalizer mechanism: Opt 2b (per-op GetHealOp poll, folded into single tick loop) — no Tendermint WS, no monotonic-growth poll. ✓ Concurrency default: semaphore=2 reconstructs (RaptorQ RAM-aware), 4 verifications, 2 publishes. ✓ Mode gate: UNSPECIFIED skips dispatcher entirely (Service.tick early-return; verified by TestService_UnspecifiedModeSkipsEntirely). ✓ Three-layer dedup: sync.Map + bounded semaphores + SQLite (heal_claims_submitted + heal_verifications_submitted). ✓ Submit-then-persist ordering with reconcile path for crash recovery. ✓ Non-empty placeholder VerificationHash on negative attestations (chain rejects empty regardless of verified bool). ✓ Caller authentication via secure-rpc / Lumera ALTS handshake at transport layer; req.VerifierAccount never trusted alone in production. Plan: docs/plans/LEP6_PR4_EXECUTION_PLAN.md
Implements the PR-5 Supernode side of LEP-6 storage-truth recheck evidence on top of the PR-4 heal-op dispatch branch. Public surfaces added: - supernode/recheck: Candidate, RecheckResult, Finder, Attestor, Service, ReporterSource, SupernodeReporterSource, eligibility and outcome mapping helpers. - pkg/storage/queries: RecheckQueries plus SQLite-backed HasRecheckSubmission and RecordRecheckSubmission. - pkg/lumera/modules/audit: GetEpochReportsByReporter query wrapper for network-wide candidate discovery. - supernode/storage_challenge: LEP6Dispatcher.Recheck to execute RECHECK-bucket proofs without adding results to epoch reports. Spec/chain alignment decisions: - Candidate discovery is network-wide: the service lists registered supernodes and scans EpochReportsByReporter over the configured lookback window, rather than only scanning this node's own report. - Recheck candidate eligibility mirrors chain storage transcript records: only HASH_MISMATCH, TIMEOUT_OR_NO_RESPONSE, OBSERVER_QUORUM_FAIL, and INVALID_TRANSCRIPT originals are eligible. - The service rejects self-target candidates and self-reported challenged results because chain SubmitStorageRecheckEvidence rejects creator == challenged_supernode_account and creator == challenged result reporter. - Recheck execution maps local PASS to PASS and confirmed hash mismatch to RECHECK_CONFIRMED_FAIL; timeout/quorum/invalid transcript classes remain explicit and are not collapsed. - Recheck execution reuses the PR-3 compound dispatcher in RECHECK bucket mode with an isolated temporary buffer so recheck results are submitted only through MsgSubmitStorageRecheckEvidence and are never included in host epoch reports. - Local dedup is submit-then-persist keyed by epoch_id + ticket_id (creator/self is implicit locally); tx hard-fail does not persist, while chain replay/already-submitted errors persist local dedup for idempotence. - Startup/config wiring is additive under storage_challenge.lep6.recheck and remains disabled unless explicitly enabled. Tests added/updated: - Eligibility matrix for all eligible and rejected result classes. - Outcome mapping for PASS, RECHECK_CONFIRMED_FAIL, timeout, quorum, and invalid transcript. - Finder lookback/order/limit/local-dedup behavior. - Network-wide reporter discovery regression so peer-reported failures are discovered and not self-report-only. - Self-target and self-reported candidate rejection pinned against chain validation. - Service mode gate and submit path. - Attestor submit-then-persist, tx hard-fail retry safety, idempotent already-submitted handling, and required-field rejection. - SQLite recheck submission idempotence/dedup preservation. - Dispatcher RECHECK execution path integration through focused package tests. Validation: - PATH=/home/openclaw/.local/go/bin:$PATH go test ./supernode/recheck ./pkg/storage/queries ./supernode/storage_challenge ./supernode/cmd ./pkg/lumera/modules/audit => PASS - PATH=/home/openclaw/.local/go/bin:$PATH go test ./supernode/host_reporter ./supernode/self_healing ./supernode/transport/grpc/self_healing ./supernode/recheck ./pkg/storage/queries ./supernode/storage_challenge ./supernode/cmd ./pkg/lumera/modules/audit => PASS - PATH=/home/openclaw/.local/go/bin:$PATH go vet ./supernode/recheck ./pkg/storage/queries ./supernode/storage_challenge ./supernode/cmd ./pkg/lumera/modules/audit ./supernode/host_reporter ./supernode/self_healing ./supernode/transport/grpc/self_healing => PASS - git diff --check => PASS - PATH=/home/openclaw/.local/go/bin:$PATH go test ./... => expected local environment failure only in pkg/storage/files due missing go-webp system headers webp/decode.h and webp/encode.h; other visible packages pass. Parent: supernode/LEP-6-heal-op-dispatch @ 043fba4.
mateeullahmalik
left a comment
There was a problem hiding this comment.
Production-Gate Review — PR #286 feat(lumera): add LEP-6 chain client extensions
Reviewed at head 96764d8. 108 files / +15.8k / −903. Three parallel deep passes across (a) self-healing, (b) storage-challenge dispatch + recheck, (c) chain-client / config / observability.
I'm flagging only real bugs / safety / consensus-or-fee-flow issues — no nits, no style.
🔴 CRITICAL
C1. applyLEP6DefaultsAndValidate silently auto-opts every operator into LEP-6 on upgrade
supernode/config/lep6.go:80-113
if !c.StorageChallengeConfig.LEP6.enabledSet { c.StorageChallengeConfig.LEP6.Enabled = true }
if !recheck.enabledSet { recheck.Enabled = true }
if !c.SelfHealingConfig.enabledSet { c.SelfHealingConfig.Enabled = true }Existing operator configs predating this PR carry none of those YAML blocks → enabledSet=false for all three → on next start the LEP-6 dispatcher, recheck attestor, and self-healing healer/verifier/finalizer all flip ON. The chain-wide enforcement-mode gate is the only safety net, and that's a global, not an operator-local opt-in. Effect: unannounced gas/fee burn on every upgraded SN, plus surprise RAM-heavy RaptorQ reseeds on tight boxes. Default these to false when omitted, or gate behind a single explicit lep6.enabled knob.
C2. Local recheck dedup is keyed on (epoch, ticket) but chain dedup is (epoch, ticket, target) — legitimate rechecks dropped
pkg/storage/queries/recheck.go:33 (PRIMARY KEY (epoch_id, ticket_id)), :53 (HasRecheckSubmission), supernode/recheck/finder.go:98 (in-memory seen keyed epoch/ticket)
The schema has the target_account column but doesn't include it in the PK or any lookup. If a ticket has multiple challenged proof results in one epoch (different targets — common when two SNs both hold the artifact), we submit a recheck for the first only and forever mark the (epoch, ticket) "submitted." Second target never gets evidence on chain. This skews chain-side N/R/D math and weakens the cross-checking LEP-6 relies on. Make PK (epoch_id, ticket_id, target_account) and thread target through HasRecheckSubmission / MarkRecheckSubmissionSubmitted / DeletePendingRecheckSubmission and the finder's seen map.
C3. Healer drops staging dir + dedup row on any unclassified submit error → silent data loss after committed-but-ack-lost tx
supernode/self_healing/healer.go:87-99
if isChainHealOpInvalidState(err) { … return nil } // matches ONE substring
_ = s.store.DeletePendingHealClaim(ctx, op.HealOpId)
_ = os.RemoveAll(stagingDir) // destructiveisChainHealOpInvalidState only matches the literal string "does not accept healer completion claim". A lost ack on an actually-committed MsgClaimHealComplete (gRPC Unavailable / DeadlineExceeded / context canceled) is indistinguishable from a real reject and we wipe the staging dir. Phase-1 doesn't publish to KAD until VERIFIED, so the only copy of the reconstructed bytes is gone. Chain reaches VERIFIED, finalizer has no claim row → PublishStagedArtefacts is never called → the network never gets the data. On any unclassified submit error, reconcile via GetHealOp (same path as reconcileExistingClaim) before deleting staging.
C4. isChainHealOpNotFound over-matches "not found" substring → wipes live claims on transient query errors
supernode/self_healing/finalizer.go:113-119
return strings.Contains(msg, "not found") || strings.Contains(msg, "not_found")Any error string containing "not found" (gRPC block N not found, codec lookup miss, key-not-found) triggers cleanupClaim(EXPIRED) → os.RemoveAll(stagingDir). Same data-loss surface as C3, driven by query failures. Match status.Code(err) == codes.NotFound plus a typed audittypes sentinel.
C5. Pre-staged pending row blocks all subsequent retries forever after crash between INSERT and submit
supernode/self_healing/service.go:334, healer.go:77-87, verifier.go:127-137
RecordPendingHealClaim writes status='pending' before the chain submit. If the SN dies between INSERT and a successful ClaimHealComplete, on restart HasHealClaim returns true (any status) and dispatchHealerOps skips this op forever. Chain still says SCHEDULED → finalizer's default branch is no-op. Heal-op silently expires; SN is penalised. Same for verifier between RecordPendingHealVerification and SubmitHealVerification → quorum may fail. Either make HasHealClaim/HasHealVerification count only status='submitted', or have the dispatcher resume pending rows.
C6. GetCompoundProof has no per-call cap on len(Ranges) or aggregate bytes → DoS + bulk-exfil channel
supernode/transport/grpc/storage_challenge/handler.go:283-353
The simple GetSliceProof enforces maxServedSliceBytes = 65_536 (line 23). GetCompoundProof validates only "all ranges same size" + "end ≤ ArtifactSize". No cap on len(req.Ranges), no cap on requestRangeLen, no aggregate byte cap. Spec contract is k=4 × 256B = 1 KiB; an authenticated peer can request 1000 ranges × 100 MB or the whole artifact. Both a DoS vector and a bulk-data-exfiltration path that bypasses the cascade access path. Reject len(Ranges) > MaxRanges (e.g. 16), requestRangeLen > 4 × LEP6CompoundRangeLenBytes, and aggregate ≤ 16 KiB.
🟠 HIGH
H1. No deadline-epoch check before ClaimHealComplete / SubmitHealVerification / recheck submit → fee burn + wasted CPU/RAM on guaranteed-rejected txs
supernode/self_healing/healer.go:87, verifier.go:137, recheck/attestor.go:41
HealOp.DeadlineEpochId is carried in the proto but never consulted. RaptorQ reseed plus VerifierFetchAttempts=3 × VerifierFetchTimeout=60s + backoff can take 3+ minutes; with storage_truth_heal_deadline_epochs=2 (system tests) the deadline routinely passes mid-flow. The chain rejects past-deadline submissions, and the rejection error string is not in isChainHealOpInvalidState, so the dispatcher retries every poll until status flips — repeat-burning fees. Fetch current epoch (Audit().GetCurrentEpoch) before each submit; if current >= deadline, cleanup + skip.
H2. Service.listOps requests pagination=nil and never walks next_key → heal-ops silently dropped at scale
supernode/self_healing/service.go:451
resp, err := s.lumera.Audit().GetHealOpsByStatus(queryCtx, status, nil)Cosmos-SDK default page = 100. Under load (chain holding many SCHEDULED/HEALER_REPORTED ops), the SN that's the assigned healer or verifier for any op past page 1 will simply never see it — silent missed heal/verification, on-chain penalty, missed quorum. Loop on resp.Pagination.NextKey until empty.
H3. Per-target dispatch failure swallowed — no StorageProofResult emitted → permanent gap in chain N/R/D math
supernode/storage_challenge/lep6_dispatch.go:295-299, plus early-return paths at :351, 372, 376, 380, 394, 403, 475, 479
Many recoverable failures (cascade-meta fetch, ordinal selection, artifact-size resolve, transcript hash, sign) just log.Warn(...); return err from dispatchTicket, with no row appended. Chain interprets absent evidence for an assigned (challenger, target, bucket) slot as missing — no other reporter fills it because the slot is taken. Internal failures should fall through to appendFail(...) with INVALID_TRANSCRIPT (or a new INTERNAL_ERROR class); reserve return err for ctx.Err().
H4. appendFail / appendNoEligible discard sign errors via sig, _ := snkeyring.SignBytes(...) → empty/garbage signature attached to result row
supernode/storage_challenge/lep6_dispatch.go:319-323, 526-530
sig, _ := snkeyring.SignBytes(...)Empty ChallengerSignature on a StorageProofResult will fail chain validation. Per MsgSubmitEpochReport semantics, a single malformed result rejects the entire epoch report → one transient keyring failure poisons the whole epoch's evidence. Propagate the error or skip the entry; never emit a row with empty signature.
H5. Result-buffer >16 throttle drops by ticket_id lex order, not by age or signal-value
supernode/storage_challenge/result_buffer.go:110-123
Comment claims "drop oldest first" but the code sorts by nonRecent[i].TicketId < nonRecent[j].TicketId. Ticket IDs are content-addressed; lex order has no relation to submission order. When >16 results overflow the chain cap, an attacker who can shape ticket IDs (or just lucky lex order) determines what reaches the chain. There's no per-(target, bucket) fairness either — all dropped slots can hit one bucket, biasing chain-side coverage stats. Either rename honestly to "lex-deterministic-drop", or carry a submission timestamp / round-robin across (target, bucket).
H6. SelectArtifactClass cross-class fallback may not match chain's authoritative class → wrong N/R/D delta routing
pkg/storagechallenge/deterministic/lep6.go:423-440
When the rolled class has zero artifacts, the supernode silently swaps to the other class. Per LEP-6 spec §14, "Symbol vs index hash mismatch artifact-class affects D/N deltas — supernode must report the correct class." If chain re-derives or validates class independently and doesn't mirror this fallback, deltas land in the wrong bucket and the trust multiplier (R/100, Class A pre-recheck only) is misapplied. Either pin a chain-anchored test vector for the fallback or emit NO_ELIGIBLE_TICKET instead of swapping classes.
H7. Verifier streaming has no max-bytes guard → buggy/malicious healer can OOM verifier
supernode/self_healing/peer_client.go:105-120
First message advertises TotalSize, but the verifier never compares accumulated length, never enforces a ceiling, never refuses oversized chunks. Read TotalSize, validate ≤ MaxReconstructedBytes, pre-allocate, and abort if len(buf)+len(msg.Chunk) exceeds TotalSize.
H8. ServeReconstructedArtefacts doesn't check op.Status → bytes served outside §19 healer-served-path window
supernode/transport/grpc/self_healing/handler.go:126-148
Authorises by healer identity + verifier-set membership. Per §19, the fetch is valid only while op.Status == HEALER_REPORTED. After VERIFIED (artefacts published to KAD), FAILED, or EXPIRED, the staging dir may still exist (cleanup is best-effort) and a former assigned verifier can pull bytes that no longer represent canonical state. Mid-stage (still SCHEDULED) it could serve a partial file. Reject with FailedPrecondition unless op.Status == HEAL_OP_STATUS_HEALER_REPORTED.
H9. Chain-error classification is fragile English-substring matching across the board
supernode/self_healing/healer.go:175-181, verifier.go:162-167, recheck/attestor.go:73-78, finalizer.go:113-119
Idempotency / dedup branches all key on strings.Contains(err.Error(), "...") against literal English chain error strings ("verification already submitted by creator", "recheck evidence already submitted", "does not accept healer completion claim", "not found"). Any chain-side error refactor (sdk version bump, error wrap, i18n) silently flips the branch into the destructive delete pending row + RemoveAll(stagingDir) path and re-submits forever. Match on typed errors via errors.Is against exported audit-module sentinels, or sdk error codes.
🟡 MEDIUM
M1. StagingDir default "heal-staging" is a relative path → process-CWD-dependent
supernode/config/defaults.go:31, supernode/cmd/start.go:301-322, supernode/config.yml:59
applyLEP6DefaultsAndValidate always pre-fills with the relative literal so the withDefaults() ~/.supernode/heal-staging fallback never fires. Under systemd WorkingDirectory=/, multi-GB reconstructed artefacts land in /heal-staging. Resolve through Config.GetFullPath(...) before passing to Service.New, or default to filepath.Join(BaseDir, "heal-staging").
M2. Goroutines in tick use the long-lived Run ctx — wedged ops never release semaphore slots
supernode/self_healing/service.go:344-353, 397-414, 434-442
semReconstruct/semVerify slots and inFlight keys leak forever on hung peer fetches or hung RaptorQ. Wrap with context.WithDeadline derived from op.DeadlineEpochId (or a hard ceiling).
M3. historyStore.CloseHistoryDB runs while LEP-6 services are still draining
supernode/cmd/start.go:413-440
After cancel(), in-flight ClaimHealComplete calls don't honor cancellation immediately. They subsequently call MarkHealClaimSubmitted against a closed DB → error → pending row never cleared → next start reconstructs again. Move CloseHistoryDB after <-servicesErr.
M4. ResolveArtifactSize for INDEX class re-runs cascadekit.GenerateIndexFiles per dispatch — perf hit + cross-version determinism risk
pkg/storagechallenge/lep6_resolution.go:139-153
Two reporters on slightly different cascadekit versions compute different sizes → different ComputeMultiRangeOffsets → different derivation_input_hash → chain treats reports as contradictions. Cache per-ticket; pin cascadekit version explicitly; add a chain-anchored test vector.
M5. RecoveryReseed reads entire reconstructed file into RAM via os.ReadFile to copy to staging
supernode/cascade/reseed.go:256-263, :346
With MaxConcurrentReconstructs=2 and large actions, peak RAM = 2 × file_size on top of the RaptorQ working set. OOM at heal time, exactly when the operator can least afford it. Use io.Copy or os.Rename (same FS).
M6. probeTCP reports DNS / route errors as PORT_STATE_CLOSED
supernode/host_reporter/service.go:348-364
Transient EHOSTUNREACH / DNS failure / ctx.Err() becomes permanent on-chain port-closed evidence. Return PORT_STATE_UNKNOWN for those; only CLOSED on explicit ECONNREFUSED.
M7. dispatchFinalizer skipped entirely under modeGate → staging dirs leak forever after a mode rollback
supernode/self_healing/service.go:276-283
If governance flips StorageTruthEnforcementMode back to UNSPECIFIED while pending claim rows + staging dirs exist, the finalizer never runs. Run dispatchFinalizer regardless of mode; only gate dispatch (healer/verifier) phases.
M8. SQLite ALTER TABLE … ADD COLUMN status runs unguarded on every startup
pkg/storage/queries/self_healing_lep6.go:85,99 (invoked at sqlite.go:401,409)
On a fresh DB the prior CREATE TABLE already has the column → ALTER returns "duplicate column name" → silently swallowed (_, _ =). The swallow also masks real errors (disk full, locked DB). Guard via PRAGMA table_info lookup; only ALTER if column missing; surface real errors. Same for recheck.go:37.
M9. Restart replay: lastRunEpoch is in-memory only
supernode/storage_challenge/service.go:170, 204
On restart the SN re-dispatches the same epoch even if host_reporter already submitted MsgSubmitEpochReport. Persist lastSubmittedEpoch per key in SQLite.
M10. ticket_provider.go requires BOTH IndexArtifactCount AND SymbolArtifactCount non-zero
supernode/storage_challenge/ticket_provider.go:108
lep6_resolution.go:42-46 says "If both counts are zero (legacy) chain accepts"; chain may also accept one-zero. This filter silently makes such tickets invisible to LEP-6 dispatch. Require at least one non-zero.
M11. Recheck() shadow-swaps the dispatcher's main buffer under a long-held lock
supernode/storage_challenge/lep6_recheck.go:42-48
Lock is held across an entire RPC round-trip to peer, serialising all rechecks and dispatches network-wide on a single SN. Pass the buffer as a parameter; don't mutate shared state under a long lock.
M12. sequenceMismatchMaxAttempts = 3 hard-coded, not threaded through TxHelperConfig
pkg/lumera/modules/tx/helper.go:19
Three new high-rate tx surfaces (ClaimHealComplete, SubmitHealVerification, SubmitStorageRecheckEvidence) inherit a non-tunable cap. Operators can't tune under chain congestion. Plumb through TxHelperConfig with the same MaxGasAdjustmentAttemptsCap-style hard ceiling enforced in both applyTxHelperDefaults AND UpdateConfig (per the supernode tx-helper safety-cap-mirroring rule).
🟢 LOW
- L1.
Server.NewServeracceptsnilresolver → in tests/silent wiring drift, handler trusts user-suppliedreq.VerifierAccount. Rejectniloutright in production constructor. (transport/grpc/self_healing/handler.go:118-120) - L2.
negativeAttestationHashmixes rawfetchErr.Error()text → per-verifier-unique placeholder hash. Hash a small canonical reason taxonomy. (self_healing/verifier.go:62-63, 116-119) - L3.
RecordPendingRecheckSubmissionusesINSERT OR IGNORE→ silently masks duplicate-attempt scenarios; we then submit anyway and the chain rejects. (pkg/storage/queries/recheck.go:74) - L4. Recheck
finderreturns whole-tick failure on first per-reporter RPC error → one unreachable reporter masks every other reporter's candidates. (recheck/finder.go:70-81) - L5.
appendNoEligiblediscards the actually-selected ticket id → loses observability and merges two distinct error modes into the same row shape. (storage_challenge/lep6_dispatch.go:358-361) - L6.
tests/system/config.lep6-1.ymlhasstorage_challenge.enabled: falsebutlep6.recheck.enabled: true→ dead recheck block; foot-gun for anyone copying the template.
✅ Audited and clean
audit_msg/impl.go— uses standardtxHelper.ExecuteTransactionfor all 5 methods;NewModulelegacy +NewModuleWithTxHelperConfigconstructors both preserved (public-API rule honored).applyTxHelperDefaults+UpdateConfigboth enforceMaxGasAdjustmentAttemptsCap(mirroring rule honored).pkg/metrics/lep6— labels are bounded enum strings; no per-ticket / per-account labels → no cardinality blow-up.- Hash determinism along the healer↔verifier path: both compute
cascadekit.ComputeBlake3DataHashB64over the full reconstructed file; healer derives manifest hash frommeta.DataHashafterVerifyB64DataHashsucceeded against the same recipe. - §19 healer-served-path identity check in
handler.gocorrectly prefers secure-RPC identity overreq.VerifierAccountfor caller resolution (when resolver wired). - Empty-verifier-set bypass: not present on supernode side — quorum is chain-side; SN never auto-finalizes.
pkg/netutil/hostport.go— IPv6 brackets, zone IDs, malformed inputs handled without panic.supernode/config/save.go— config file written0600, dir0700. No secrets logged.
Test-coverage gaps that would have caught the above at CI
- No test for
SelectArtifactClasscross-class fallback against a chain-anchored vector (H6). - No test for >16 result-buffer throttle ordering / fairness (H5).
- No restart-replay scenario for either heal-claim pending-row or storage-challenge
lastRunEpoch(C5, M9). - No test for multi-target same-ticket recheck dedup (C2).
- No test for chain-error classification — every "is-already-submitted" / "is-not-found" branch is on a hand-typed substring with no fixture pinning the exact chain error.
- No test for
GetCompoundProofadversarial range payloads (C6).
Verdict
REQUEST CHANGES. Six CRITICAL, nine HIGH. The two highest-impact classes are (1) silent data-loss paths in self-healing driven by ordinary transient errors (C3, C4, C5, H9) and (2) chain-side N/R/D math being silently wrong because of (a) the recheck dedup PK gap (C2), (b) swallowed dispatch errors (H3), (c) malformed-signature row emission (H4), and (d) the cross-class fallback non-determinism (H6). The auto-opt-in default (C1) and unbounded GetCompoundProof (C6) are operator-impact / DoS issues that should not ship as-is.
Most of these are localised fixes — the biggest structural one is replacing English-substring chain-error matching with typed sentinels, which is one diff that closes most of the data-loss surface.
Resolves all 33 items from mateeullahmalik's CHANGES_REQUESTED review on PR #286. Per Matee's lens — silent data-loss, chain N/R/D math fragility, operator-impact / fee-burn, DoS / bulk-exfil on authenticated handlers, English-substring chain-error matching — every finding is closed without spec divergence and with regression tests. Highlights by failure class: 1. Typed chain-error sentinels (H9 umbrella; foundation for C3/C4/C5/H1/L3) - new pkg/lumera/chainerrors with predicates (errors.Is + gRPC code + substring fallback) and transient short-circuit; replaces every strings.Contains(err.Error(), …) under self_healing/recheck/ storage_challenge. 2. Storage layer (C2, M8, M12, L3, L4) - recheck dedup migrated to (heal_op_id, target_supernode) PK with typed ErrAlreadyExists, INSERT … ON CONFLICT DO NOTHING. - PRAGMA-guarded ALTER TABLE migrations; tx-helper sequence cap and fairness; finder per-reporter error isolation. 3. Self-healing safety overhaul (C3, C4, C5, H1, H2, H7, M2, M5, M7, L2) - reconcile-not-purge for transient errors; pre-submit deadline-epoch check; paginated GetHealOpsByStatus; bounded streaming caps; per-op deadline goroutines; finalizer runs regardless of mode-gate; reseed via os.Rename / io.Copy; canonical negative-attestation reason taxonomy. 4. Storage-challenge dispatch + buffer (C2, H3, H4, H5, H6, M4, M9, M10, M11, L5) - chain-anchored partial rows for pre-derivation early returns (ctx.Err passthrough); sign errors drop the row + metric, never lie; arrival-order + (target,bucket) fairness buffer (no lex shaping); no class swap when rolled class empty (NO_ELIGIBLE_TICKET only); bounded LRU index-size cache; SQLite-persisted lastSubmittedEpoch; at-least-one-class-non-zero ticket gate; per-call recheck buffer. 5. Operator config / shutdown / probing (C1, M1, M3, M6, L6) - LEP-6 toggles default-FALSE on missing config; startup advisory WARN names each disabled service; structural validator rejects recheck=true with parents disabled; staging-dir resolved via GetFullPath; historyStore.CloseHistoryDB moved after services drain; probeTCP taxonomy distinguishes ECONNREFUSED (CLOSED) from DNS / EHOSTUNREACH / ctx.Err / Timeout (UNKNOWN); fixtures aligned with new gating chain. 6. Transport handlers (C6, H8, L1) - GetCompoundProof: per-call MaxCompoundRanges=16, per-range cap 4*LEP6CompoundRangeLenBytes, MaxCompoundAggregateBytes=16 KiB; rejected before any artifact bytes are read. - ServeReconstructedArtefacts: gated on op.Status == HealOpStatus_HEAL_OP_STATUS_HEALER_REPORTED. - NewServer rejects nil resolveCaller; NewServerForTest is the documented test-only escape hatch. Spec-fidelity: no scoring constants changed; no chain-side semantics altered. Chain-anchored validator rules cited at chain path:line for every consensus- affecting branch (PK shape, partial-row class, dispatch class fallback, deadline sentinel). Validation: go build, go vet, focused per-wave package tests, and full go test $(go list ./... | grep -v /tests) -count=1 sweep — zero regressions across 50+ packages.
No description provided.